AITopics | vocabulary pruning

Collaborating Authors

vocabulary pruning

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Prune or Retrain: Optimizing the Vocabulary of Multilingual Models for Estonian

Dorkin, Aleksei, Purason, Taido, Sirts, Kairit

arXiv.org Artificial IntelligenceJan-5-2025

Adapting multilingual language models to specific languages can enhance both their efficiency and performance. In this study, we explore how modifying the vocabulary of a multilingual encoder model to better suit the Estonian language affects its downstream performance on the Named Entity Recognition (NER) task. The motivations for adjusting the vocabulary are twofold: practical benefits affecting the computational cost, such as reducing the input sequence length and the model size, and performance enhancements by tailoring the vocabulary to the particular language. We evaluate the effectiveness of two vocabulary adaptation approaches -- retraining the tokenizer and pruning unused tokens -- and assess their impact on the model's performance, particularly after continual training. While retraining the tokenizer degraded the performance of the NER task, suggesting that longer embedding tuning might be needed, we observed no negative effects on pruning.

artificial intelligence, natural language, text processing, (17 more...)

arXiv.org Artificial Intelligence

2501.02631

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre: Research Report > New Finding (0.49)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

Analysing the Impact of Removing Infrequent Words on Topic Quality in LDA Models

Bystrov, Victor, Naboka-Krell, Viktoriia, Staszewska-Bystrova, Anna, Winker, Peter

arXiv.org Artificial IntelligenceNov-24-2023

The use of topic modelling techniques, especially Latent Dirichlet Allocation (LDA) introduced by Blei et al. (2003), is growing fast. The methods find application in a broad variety of domains. In text-as-data applications, LDA enables the analysis of large collections of text in an unsupervised manner by uncovering latent structures behind the data. Given this increasing use of LDA as a standard tool for empirical analysis, also the interest in details of the method and, in particular, in parameter settings for its implementation is rising. Thus, since the introduction of the LDA approach in 2003 by Blei et al., different methodological components of LDA have already been studied in more detail as, for example, the choice of the number of topics (Cao et al., 2009; Mimno et al., 2011; Lewis and Grossetti, 2022; Bystrov et al., 2022a), hyper-parameter settings (Wallach et al., 2009), model design (e.g.

cut-off value, frequency, vocabulary pruning, (16 more...)

arXiv.org Artificial Intelligence

2311.14505

Country:

Europe > Poland > Łódź Province > Łódź (0.05)
Europe > Germany (0.04)
Asia > Middle East > Jordan (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Government (0.68)
Banking & Finance (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.92)

Add feedback